We introduce multi-modal, attention-based neural machine translation (NMT)models which incorporate visual features into different parts of both theencoder and the decoder. We utilise global image features extracted using apre-trained convolutional neural network and incorporate them (i) as words inthe source sentence, (ii) to initialise the encoder hidden state, and (iii) asadditional data to initialise the decoder hidden state. In our experiments, weevaluate how these different strategies to incorporate global image featurescompare and which ones perform best. We also study the impact that addingsynthetic multi-modal, multilingual data brings and find that the additionaldata have a positive impact on multi-modal models. We report newstate-of-the-art results and our best models also significantly improve on acomparable phrase-based Statistical MT (PBSMT) model trained on the Multi30kdata set according to all metrics evaluated. To the best of our knowledge, itis the first time a purely neural model significantly improves over a PBSMTmodel on all metrics evaluated on this data set.
展开▼